1. Getting started

1.1 Load required packages

First, let’s load the R packages necessary for our project.

1.2 Load the dataset

Access the data from PhysioNet from the WIDS Datathon Project Page. Download ‘training_v2.csv’ from the “Files” section of the project.

Now we can load in our data and start to work with it.

1.3 Subset selection

We will include four predictors in our model: age, bmi, gender, and glucose apache. Feel free to include more or choose entirely different predictors in your model!

#  Select subset of variables from original data
reduced_data <- gossis %>% select(bmi,
                                  age,
                                  gender,
                                  glucose_apache,
                                  diabetes_mellitus, h1_glucose_max, d1_glucose_max,ethnicity,wbc_apache)
head(reduced_data) %>% kable()
bmi age gender glucose_apache diabetes_mellitus h1_glucose_max d1_glucose_max ethnicity wbc_apache
22.73 68 M 168 1 NA 168 Caucasian 14.1
27.42 77 F 145 1 145 145 Caucasian 12.7
31.95 25 F NA 0 NA NA Caucasian NA
22.64 81 F 185 0 NA 185 Caucasian 8.0
NA 19 M NA 0 NA NA Caucasian NA
27.56 67 M 156 1 NA 156 Caucasian 10.9

We will also define our outcome variable and drop data with unknown outcomes

# Set the outcome variable here
reduced_data$outcome_variable <- as.factor(reduced_data$diabetes_mellitus)
reduced_data <- subset(reduced_data, select = -c(diabetes_mellitus))

# Check number of rows
nrow(reduced_data)
## [1] 91713
# For simplicity, we will drop all rows with missing outcomes
reduced_data <- drop_na(reduced_data, any_of("outcome_variable"))

# Check number of rows after removing rows with missing outcomes
nrow(reduced_data)
## [1] 90998

2. Create the training and testing sets

2.1 Encoding

We’ll encode our categorical variables. Encoding is the process of reshaping and binarizing categorical data to better suit machine learning models.

# Encode gender variable: male = 1, non-male = 0
reduced_data$gender <- ifelse(reduced_data$gender == "M", 1, 0)

2.2 Split the data

Let’s create the training and test datasets.

# Set the random number seed for reproducibility
set.seed(1)

# Create data partition using the outcome variable
train_index <- createDataPartition(reduced_data$outcome_variable,
                                   times = 1, p = 0.8, list = FALSE)

# Split data into train and test sets, select columns that will be used in model
train_set <- reduced_data[train_index, ]
head(train_set)
## # A tibble: 6 × 9
##     bmi   age gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
##   <dbl> <dbl>  <dbl>          <dbl>          <dbl>          <dbl> <chr>    
## 1  22.7    68      1            168             NA            168 Caucasian
## 2  32.0    25      0             NA             NA             NA Caucasian
## 3  22.6    81      0            185             NA            185 Caucasian
## 4  57.4    59      0            197            197            197 Caucasian
## 5  NA      70      1            164             NA            129 Caucasian
## 6  NA      45      1            380            365            365 Caucasian
## # … with 2 more variables: wbc_apache <dbl>, outcome_variable <fct>
test_set <- reduced_data[-train_index, ]
head(test_set)
## # A tibble: 6 × 9
##     bmi   age gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
##   <dbl> <dbl>  <dbl>          <dbl>          <dbl>          <dbl> <chr>    
## 1  27.4    77      0            145            145            145 Caucasian
## 2  NA      19      1             NA             NA             NA Caucasian
## 3  27.6    67      1            156             NA            156 Caucasian
## 4  25.7    50      1            134            134            134 <NA>     
## 5  38.2    81      1            120            121            121 Caucasian
## 6  23.4    79      0            175             NA            175 Caucasian
## # … with 2 more variables: wbc_apache <dbl>, outcome_variable <fct>

3. View summary statistics (“Table 1”)

allvars <- c("bmi", "age", "gender", "glucose_apache", "h1_glucose_max", "d1_glucose_max","ethnicity", "wbc_apache")
catvars <- c("gender")
table_gossis <- CreateTableOne(vars = allvars, data = train_set,
                factorVars = catvars,
                strata = "outcome_variable")
kableone(table_gossis, caption = "Summary statistics of the training set")
Summary statistics of the training set
0 1 p test
n 56405 16394
bmi (mean (SD)) 28.41 (7.86) 31.82 (9.00) <0.001
age (mean (SD)) 61.57 (17.41) 64.62 (14.45) <0.001
gender = 1 (%) 30270 (53.7) 9044 (55.2) 0.001
glucose_apache (mean (SD)) 141.36 (73.15) 218.80 (112.45) <0.001
h1_glucose_max (mean (SD)) 147.92 (73.63) 214.95 (118.80) <0.001
d1_glucose_max (mean (SD)) 154.69 (68.89) 239.15 (104.81) <0.001
ethnicity (%) <0.001
African American 5620 (10.1) 1977 (12.2)
Asian 647 ( 1.2) 234 ( 1.4)
Caucasian 43997 (79.3) 12106 (74.6)
Hispanic 2279 ( 4.1) 748 ( 4.6)
Native American 419 ( 0.8) 218 ( 1.3)
Other/Unknown 2509 ( 4.5) 953 ( 5.9)
wbc_apache (mean (SD)) 12.11 (6.97) 12.19 (6.72) 0.243

4. Preprocessing the data

4.1 Impute missing values

We have to consider the NA values in our data. For simplicity, we will replace missing values with the median of the train_set.

Note that it is important that we do this after splitting into training and test sets. If we imputed values using test data, we would risk leaking information about our test set into our training set (“data leakage”).

Feel free to experiment with different imputation techniques!

predictors <- c("age", "bmi", "gender", "glucose_apache", "h1_glucose_max", "d1_glucose_max", "ethnicity", "wbc_apache")

for (col in predictors) {
    test_set[[col]][is.na(test_set[[col]])] <-
        median(train_set[[col]], na.rm = TRUE)
    train_set[[col]][is.na(train_set[[col]])] <-
        median(train_set[[col]], na.rm = TRUE)
}

4.2 Normalization

For many models it is important to normalize the data. Without normalization, variables with larger magnitudes may have a greater effect on the model.

One common approach for normalization is to divide each variable by its standard deviation and subtract the mean. As before, to avoid data leakage, normalization should be done after splitting into training and test sets.

Our tree model does not require normalization, so we will skip this step.

5. Model building

5.1 Train a random forest classifier

see train data distribution

hist.data.frame(train_set)

change model!! http://topepo.github.io/caret/train-models-by-tag.html

for example: method = “svmPoly”, method = “glm”

# Define forest tuning parameters
# 5-fold cross validation
control <- trainControl(method = "repeatedcv", number = 2)
# Number of variables tried at each split
mtry <- sqrt(ncol(train_set))
# Grid search is a linear search through a vector of mtry values
tunegrid <- expand.grid(.mtry = mtry)
# Create classification forest using age, bmi, and gender
forest <- train(outcome_variable ~ age + bmi + gender + glucose_apache + h1_glucose_max + d1_glucose_max + ethnicity + wbc_apache,
              data = train_set,
              method = "rf",
              metric = "Accuracy",
              tuneGrid = tunegrid,
              trControl = control)

5.2 View variable importance

  • Many approaches for computing “importance”.
  • Shapley Additive exPlanations (ShAP) values are a popular alternative.
  • Compute the marginal contribution of a feature across a combination of features.
  • We will use caret variable importance Documentation
# Calculate variable importance
importance <- varImp(forest)
kable(importance$importance)
Overall
age 47.0295952
bmi 67.4601747
gender 5.8278661
glucose_apache 80.2409013
h1_glucose_max 44.6706070
d1_glucose_max 100.0000000
ethnicityAsian 0.1917897
ethnicityCaucasian 4.1155818
ethnicityHispanic 1.2941783
ethnicityNative American 0.0000000
ethnicityOther/Unknown 1.4721757
wbc_apache 50.3555599

We can see that glucose_apache and bmi had the most predictive power.

5.3 Predict diabetes mellitus in “unseen patients”

Select patients from the test set

  • Let’s look some patients in our test set
  • What status do you expect for these patients?
head(test_set, 5) %>% select(age, bmi, gender, glucose_apache, h1_glucose_max, d1_glucose_max, ethnicity, wbc_apache)
## # A tibble: 5 × 8
##     age   bmi gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
##   <dbl> <dbl>  <dbl>          <dbl>          <dbl>          <dbl> <chr>    
## 1    77  27.4      0            145            145            145 Caucasian
## 2    19  27.7      1            133            140            150 Caucasian
## 3    67  27.6      1            156            140            156 Caucasian
## 4    50  25.7      1            134            134            134 Caucasian
## 5    81  38.2      1            120            121            121 Caucasian
## # … with 1 more variable: wbc_apache <dbl>
original_test_set <- test_set

see test data distribution

hist.data.frame(original_test_set)

allvars <- c("bmi", "age", "gender", "glucose_apache", "h1_glucose_max", "d1_glucose_max","ethnicity", "wbc_apache")
catvars <- c("gender")
table_gossis <- CreateTableOne(vars = allvars, data = original_test_set,
                factorVars = catvars,
                strata = "outcome_variable")

kableone(table_gossis, caption = "Summary statistics of the test set")
Summary statistics of the test set
0 1 p test
n 14101 4098
bmi (mean (SD)) 28.37 (7.81) 31.70 (8.82) <0.001
age (mean (SD)) 61.85 (16.81) 65.08 (13.90) <0.001
gender = 1 (%) 7543 (53.5) 2242 (54.7) 0.175
glucose_apache (mean (SD)) 140.80 (68.46) 214.12 (111.57) <0.001
h1_glucose_max (mean (SD)) 143.03 (45.38) 182.37 (97.99) <0.001
d1_glucose_max (mean (SD)) 154.69 (66.49) 236.51 (105.41) <0.001
ethnicity (%) <0.001
African American 1388 ( 9.8) 476 (11.6)
Asian 181 ( 1.3) 64 ( 1.6)
Caucasian 11261 (79.9) 3067 (74.8)
Hispanic 542 ( 3.8) 193 ( 4.7)
Native American 98 ( 0.7) 47 ( 1.1)
Other/Unknown 631 ( 4.5) 251 ( 6.1)
wbc_apache (mean (SD)) 11.70 (6.10) 11.93 (6.21) 0.030

Predict diabetes mellitus in unseen patients

####Original results:

# Make predictions on test set

test_set <- original_test_set

forest_pred <- predict(forest,
                       newdata = test_set,
                       type = "raw")

# Combine unseen patients data with corresponding predictions
data.frame(age = test_set$age,
           bmi = test_set$bmi,
           gender = test_set$gender,
           glucose_apache = test_set$glucose_apache,
           h1_glucose_max = test_set$h1_glucose_max, 
           d1_glucose_max = test_set$d1_glucose_max,
           ethnicity = test_set$ethnicity,
           wbc_apache = test_set$wbc_apache,
           prediction = forest_pred,
           truth_value = test_set$outcome_variable) %>%
          head(30) %>%
          kable()
age bmi gender glucose_apache h1_glucose_max d1_glucose_max ethnicity wbc_apache prediction truth_value
77 27.42000 0 145.0 145.000 145 Caucasian 12.7 0 1
19 27.65432 1 133.0 140.000 150 Caucasian 10.4 0 0
67 27.56000 1 156.0 140.000 156 Caucasian 10.9 0 1
50 25.71000 1 134.0 134.000 134 Caucasian 8.4 0 0
81 38.18907 1 120.0 121.000 121 Caucasian 6.9 0 0
79 23.40898 0 175.0 140.000 175 Caucasian 13.5 0 0
65 27.65432 0 205.0 140.000 208 Caucasian 10.1 0 1
60 26.48571 1 149.0 140.000 149 African American 7.5 0 0
71 34.30569 0 157.0 140.000 157 Caucasian 5.4 0 0
58 27.65432 0 88.0 140.000 88 Caucasian 6.7 0 0
85 21.88964 0 81.0 140.000 81 Caucasian 6.0 0 0
65 26.28642 1 361.0 140.000 361 African American 10.4 1 0
82 23.73812 1 121.0 140.000 121 Caucasian 8.5 0 0
64 16.98039 1 96.0 140.000 113 Caucasian 21.8 0 0
59 14.84493 0 71.0 140.000 182 African American 1.8 0 0
55 32.20536 1 81.0 95.000 95 Caucasian 8.2 0 0
62 27.26740 1 72.0 87.000 87 Caucasian 3.5 0 0
56 34.15440 0 133.0 140.000 150 Caucasian 10.4 0 0
73 20.35197 0 159.0 140.000 159 Caucasian 15.6 0 0
48 26.51648 1 96.0 140.000 96 Caucasian 5.6 0 0
65 33.00135 0 94.0 140.000 94 Caucasian 10.4 0 0
55 24.12382 1 114.0 140.000 114 Caucasian 4.1 0 0
71 35.07812 0 181.0 140.000 181 Caucasian 12.1 0 1
47 28.86719 0 76.0 82.000 121 Asian 4.7 0 0
75 27.65432 0 598.7 695.045 611 Caucasian 11.4 1 1
55 23.74531 0 160.0 192.000 192 African American 16.2 0 0
74 27.65432 0 324.0 140.000 324 Caucasian 18.0 1 1
68 18.45329 0 260.0 291.000 291 Caucasian 27.6 0 0
89 27.25875 1 106.0 140.000 106 Caucasian 15.4 0 0
69 34.75311 0 133.0 140.000 145 Caucasian 10.4 0 0
# Get the probabilities for our forest predictions
forest_probs <- predict(forest, newdata = test_set, type = "prob")
forest_probs <- forest_probs[, 2]
# Create a "prediction" object using our probabilities and the outcome variable
forest_roc <- prediction(forest_probs, test_set$outcome_variable)
forest_perf <- performance(forest_roc, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(forest_perf, col = rainbow(10))
abline(a = 0, b = 1)

# Calculate the AUC
auc_forest <- performance(forest_roc, measure = "auc")
auc_forest <- auc_forest@y.values[[1]]

# Ground truth is recorded in the GOSSIS data
cm = confusionMatrix(forest_pred,
                test_set$outcome_variable,
                positive = "1")$tab


Reference <- factor(c(0, 0, 1, 1))
Prediction <- factor(c(0, 1, 0, 1))
Y      <- c(cm[1], cm[2], cm[3], cm[4])
df_2 <- data.frame(Reference, Prediction, Y)

ggplot(data =  df_2, mapping = aes(x = Reference, y = Prediction)) + ggtitle("Overall prediction")+
  geom_tile(aes(fill = Y), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Y)), vjust = .5, fontface  = "bold", alpha = 1) +
  scale_fill_gradient(low="white", high="#009194") +
  theme_bw() + theme(legend.position = "none")+ coord_fixed()

accuracy <- (cm[1] + cm[4])/(cm[1] + cm[2] + cm[3] + cm[4]) 
recall = cm[4]/ (cm[4] + cm[2])
specificity = cm[1]/ (cm[1] + cm[3])

print("auc:")
## [1] "auc:"
print(auc_forest)
## [1] 0.8130455
print("accuracy:")
## [1] "accuracy:"
print(accuracy)
## [1] 0.8068026
print("recall:")
## [1] "recall:"
print(recall)
## [1] 0.612442
print("specificity:")
## [1] "specificity:"
print(specificity)
## [1] 0.8390238

A dataframe to save results

df_result <- data.frame(name = character(), auc = double(), accuracy = double(), recall = double(), specificity = double(), range = character())

Subgroup analysis (discrete variables):

library(ggplot2)

var_list <- c("gender", "ethnicity")


for (var in var_list)
{
print(var)
    
list <- unique(original_test_set[[var]])

for (i in list)
{
test_set <- original_test_set[original_test_set[[var]] == i, ]
num = dim(test_set)[1]

# Make predictions on test set
forest_pred <- predict(forest,
                       newdata = test_set,
                       type = "raw")

data.frame(age = test_set$age,
           bmi = test_set$bmi,
           gender = test_set$gender,
           glucose_apache = test_set$glucose_apache,
           h1_glucose_max = test_set$h1_glucose_max, 
           d1_glucose_max = test_set$d1_glucose_max,
           ethnicity = test_set$ethnicity,
           wbc_apache = test_set$wbc_apache,
           prediction = forest_pred,
           truth_value = test_set$outcome_variable) 

  
# Get the probabilities for our forest predictions
forest_probs <- predict(forest, newdata = test_set, type = "prob")
forest_probs <- forest_probs[, 2]
# Create a "prediction" object using our probabilities and the outcome variable
forest_roc <- prediction(forest_probs, test_set$outcome_variable)
forest_perf <- performance(forest_roc, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(forest_perf, col = rainbow(10), main= paste(var,":",i,"(",num,")"))
abline(a = 0, b = 1)
# Calculate the AUC
auc_forest <- performance(forest_roc, measure = "auc")
auc_forest <- auc_forest@y.values[[1]]


# Ground truth is recorded in the GOSSIS data
cm = confusionMatrix(forest_pred,
                test_set$outcome_variable,
                positive = "1")$tab


Reference <- factor(c(0, 0, 1, 1))
Prediction <- factor(c(0, 1, 0, 1))
Y      <- c(cm[1], cm[2], cm[3], cm[4])
df_2 <- data.frame(Reference, Prediction, Y)

print(ggplot(data =  df_2, mapping = aes(x = Reference, y = Prediction)) + ggtitle(paste(var,":",i,"(",num,")")) +
  geom_tile(aes(fill = Y), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Y)), vjust = .5, fontface  = "bold", alpha = 1) +
  scale_fill_gradient(low="white", high="#009194") +
  theme_bw() + theme(legend.position = "none")+ coord_fixed())

accuracy <- (cm[1] + cm[4])/(cm[1] + cm[2] + cm[3] + cm[4]) 
recall = cm[4]/ (cm[4] + cm[2])
specificity = cm[1]/ (cm[1] + cm[3])


df_result <- rbind(df_result, data.frame(name = var, auc = auc_forest,accuracy = accuracy,recall = recall,specificity = specificity, range = i))

print(paste(i,"(", num, "):"))
print("auc:")
print(auc_forest)
print("accuracy:")
print(accuracy)
print("recall:")
print(recall)
print("specificity:")
print(specificity)

}
}
## [1] "gender"

## [1] "0 ( 8414 ):"
## [1] "auc:"
## [1] 0.8209621
## [1] "accuracy:"
## [1] 0.8168529
## [1] "recall:"
## [1] 0.6338148
## [1] "specificity:"
## [1] 0.8466215

## [1] "1 ( 9785 ):"
## [1] "auc:"
## [1] 0.8059802
## [1] "accuracy:"
## [1] 0.7985692
## [1] "recall:"
## [1] 0.5961675
## [1] "specificity:"
## [1] 0.832617
## [1] "ethnicity"

## [1] "Caucasian ( 14328 ):"
## [1] "auc:"
## [1] 0.8179748
## [1] "accuracy:"
## [1] 0.8161642
## [1] "recall:"
## [1] 0.6157135
## [1] "specificity:"
## [1] 0.8462712

## [1] "African American ( 1864 ):"
## [1] "auc:"
## [1] 0.777936
## [1] "accuracy:"
## [1] 0.7757511
## [1] "recall:"
## [1] 0.5917722
## [1] "specificity:"
## [1] 0.8133075

## [1] "Asian ( 245 ):"
## [1] "auc:"
## [1] 0.7922566
## [1] "accuracy:"
## [1] 0.7632653
## [1] "recall:"
## [1] 0.5833333
## [1] "specificity:"
## [1] 0.7942584

## [1] "Hispanic ( 735 ):"
## [1] "auc:"
## [1] 0.814394
## [1] "accuracy:"
## [1] 0.7768707
## [1] "recall:"
## [1] 0.5986395
## [1] "specificity:"
## [1] 0.8214286

## [1] "Native American ( 145 ):"
## [1] "auc:"
## [1] 0.8129614
## [1] "accuracy:"
## [1] 0.737931
## [1] "recall:"
## [1] 0.6285714
## [1] "specificity:"
## [1] 0.7727273

## [1] "Other/Unknown ( 882 ):"
## [1] "auc:"
## [1] 0.8015829
## [1] "accuracy:"
## [1] 0.7721088
## [1] "recall:"
## [1] 0.6373626
## [1] "specificity:"
## [1] 0.8071429

Subgroup analysis (continuous variables):

df <- data.frame(name = "age", range = c(20, 40, 60, 80))
df <- rbind(df, data.frame(name = "bmi", range = c(20, 30, 40)))
df <- rbind(df, data.frame(name = "glucose_apache", range = c(100, 200, 300)))
df <- rbind(df, data.frame(name = "h1_glucose_max", range = c(100, 200, 300)))
df <- rbind(df, data.frame(name = "d1_glucose_max", range = c(100, 200, 300)))
df <- rbind(df, data.frame(name = "wbc_apache", range = c(5, 10, 15, 20)))
var_list <- c("age","bmi","glucose_apache", "h1_glucose_max", "d1_glucose_max", "wbc_apache")

for (var in var_list)
{
print(var)
var_range = df[df$name==var,"range"]

list <- unique(original_test_set[[var]])


for (i in 0:length(var_range)+1)
{
  

  
  
if(i==1){
  test_set <- original_test_set[original_test_set[[var]] < var_range[i], ]
}
else if(i==length(var_range)+1){
  test_set <- original_test_set[original_test_set[[var]] > var_range[i-1], ]
}
else{
  test_set <- original_test_set[original_test_set[[var]] > var_range[i-1] & original_test_set[[var]] < var_range[i], ]
}

num = dim(test_set)[1]

if(i==1){
  range_text = paste("[", i, "]", "<", var_range[i], "(",num, ")")
}
else if(i==length(var_range)+1){
  range_text = paste("[", i, "]", ">", var_range[i-1], "(",num, ")")
}
else{
  range_text = paste("[", i, "]", var_range[i-1],"~", var_range[i], "(",num, ")")
}
  
# Make predictions on test set
forest_pred <- predict(forest,
                       newdata = test_set,
                       type = "raw")

data.frame(age = test_set$age,
           bmi = test_set$bmi,
           gender = test_set$gender,
           glucose_apache = test_set$glucose_apache,
           h1_glucose_max = test_set$h1_glucose_max, 
           d1_glucose_max = test_set$d1_glucose_max,
           ethnicity = test_set$ethnicity,
           wbc_apache = test_set$wbc_apache,
           prediction = forest_pred,
           truth_value = test_set$outcome_variable) 

  
# Get the probabilities for our forest predictions
forest_probs <- predict(forest, newdata = test_set, type = "prob")
forest_probs <- forest_probs[, 2]
# Create a "prediction" object using our probabilities and the outcome variable
forest_roc <- prediction(forest_probs, test_set$outcome_variable)
forest_perf <- performance(forest_roc, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(forest_perf, col = rainbow(10), main= paste(var,":",range_text))
abline(a = 0, b = 1)
# Calculate the AUC
auc_forest <- performance(forest_roc, measure = "auc")
auc_forest <- auc_forest@y.values[[1]]

print(paste(range_text,":"))


# Ground truth is recorded in the GOSSIS data
cm = confusionMatrix(forest_pred,
                test_set$outcome_variable,
                positive = "1")$tab

Reference <- factor(c(0, 0, 1, 1))
Prediction <- factor(c(0, 1, 0, 1))
Y      <- c(cm[1], cm[2], cm[3], cm[4])
df_2 <- data.frame(Reference, Prediction, Y)

print(ggplot(data =  df_2, mapping = aes(x = Reference, y = Prediction)) + ggtitle(range_text) +
  geom_tile(aes(fill = Y), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Y)), vjust = .5, fontface  = "bold", alpha = 1) +
  scale_fill_gradient(low="white", high="#009194") +
  theme_bw() + theme(legend.position = "none")+ coord_fixed())

accuracy <- (cm[1] + cm[4])/(cm[1] + cm[2] + cm[3] + cm[4]) 
recall = cm[4]/ (cm[4] + cm[2])
specificity = cm[1]/ (cm[1] + cm[3])

df_result <- rbind(df_result, data.frame(name = var, auc = auc_forest,accuracy = accuracy,recall = recall,specificity = specificity, range = range_text))


print("auc:")
print(auc_forest)
print("accuracy:")
print(accuracy)
print("recall:")
print(recall)
print("specificity:")
print(specificity)

}
}
## [1] "age"

## [1] "[ 1 ] < 20 ( 160 ) :"

## [1] "auc:"
## [1] 0.9911937
## [1] "accuracy:"
## [1] 0.9625
## [1] "recall:"
## [1] 0.7
## [1] "specificity:"
## [1] 1

## [1] "[ 2 ] 20 ~ 40 ( 1653 ) :"

## [1] "auc:"
## [1] 0.9303992
## [1] "accuracy:"
## [1] 0.9122807
## [1] "recall:"
## [1] 0.6790123
## [1] "specificity:"
## [1] 0.9376258

## [1] "[ 3 ] 40 ~ 60 ( 4691 ) :"

## [1] "auc:"
## [1] 0.8298128
## [1] "accuracy:"
## [1] 0.8166702
## [1] "recall:"
## [1] 0.5799419
## [1] "specificity:"
## [1] 0.857357

## [1] "[ 4 ] 60 ~ 80 ( 8360 ) :"

## [1] "auc:"
## [1] 0.7820416
## [1] "accuracy:"
## [1] 0.7767943
## [1] "recall:"
## [1] 0.6246334
## [1] "specificity:"
## [1] 0.8064608

## [1] "[ 5 ] > 80 ( 2420 ) :"

## [1] "auc:"
## [1] 0.773025
## [1] "accuracy:"
## [1] 0.8144628
## [1] "recall:"
## [1] 0.6103896
## [1] "specificity:"
## [1] 0.8359982
## [1] "bmi"

## [1] "[ 1 ] < 20 ( 1426 ) :"

## [1] "auc:"
## [1] 0.834415
## [1] "accuracy:"
## [1] 0.8870968
## [1] "recall:"
## [1] 0.5042735
## [1] "specificity:"
## [1] 0.921314

## [1] "[ 2 ] 20 ~ 30 ( 10204 ) :"

## [1] "auc:"
## [1] 0.815502
## [1] "accuracy:"
## [1] 0.8360447
## [1] "recall:"
## [1] 0.6160991
## [1] "specificity:"
## [1] 0.8591229

## [1] "[ 3 ] 30 ~ 40 ( 4930 ) :"

## [1] "auc:"
## [1] 0.7814041
## [1] "accuracy:"
## [1] 0.7561866
## [1] "recall:"
## [1] 0.6107249
## [1] "specificity:"
## [1] 0.7935254

## [1] "[ 4 ] > 40 ( 1636 ) :"

## [1] "auc:"
## [1] 0.7548043
## [1] "accuracy:"
## [1] 0.7114914
## [1] "recall:"
## [1] 0.6424242
## [1] "specificity:"
## [1] 0.7414549
## [1] "glucose_apache"

## [1] "[ 1 ] < 100 ( 4572 ) :"

## [1] "auc:"
## [1] 0.7937578
## [1] "accuracy:"
## [1] 0.861986
## [1] "recall:"
## [1] 0.6477273
## [1] "specificity:"
## [1] 0.8661909

## [1] "[ 2 ] 100 ~ 200 ( 9650 ) :"

## [1] "auc:"
## [1] 0.753888
## [1] "accuracy:"
## [1] 0.8596891
## [1] "recall:"
## [1] 0.5737705
## [1] "specificity:"
## [1] 0.8671061

## [1] "[ 3 ] 200 ~ 300 ( 2503 ) :"

## [1] "auc:"
## [1] 0.6470318
## [1] "accuracy:"
## [1] 0.6048742
## [1] "recall:"
## [1] 0.619211
## [1] "specificity:"
## [1] 0.592371

## [1] "[ 4 ] > 300 ( 1269 ) :"

## [1] "auc:"
## [1] 0.5876709
## [1] "accuracy:"
## [1] 0.6052009
## [1] "recall:"
## [1] 0.6152399
## [1] "specificity:"
## [1] 0.5533981
## [1] "h1_glucose_max"

## [1] "[ 1 ] < 100 ( 1193 ) :"

## [1] "auc:"
## [1] 0.8047401
## [1] "accuracy:"
## [1] 0.8365465
## [1] "recall:"
## [1] 0.6027397
## [1] "specificity:"
## [1] 0.8517857

## [1] "[ 2 ] 100 ~ 200 ( 15255 ) :"

## [1] "auc:"
## [1] 0.7951695
## [1] "accuracy:"
## [1] 0.8254998
## [1] "recall:"
## [1] 0.582517
## [1] "specificity:"
## [1] 0.8486502

## [1] "[ 3 ] 200 ~ 300 ( 1029 ) :"

## [1] "auc:"
## [1] 0.626052
## [1] "accuracy:"
## [1] 0.5986395
## [1] "recall:"
## [1] 0.6352941
## [1] "specificity:"
## [1] 0.5483871

## [1] "[ 4 ] > 300 ( 636 ) :"

## [1] "auc:"
## [1] 0.5722972
## [1] "accuracy:"
## [1] 0.6509434
## [1] "recall:"
## [1] 0.6672504
## [1] "specificity:"
## [1] 0.5076923
## [1] "d1_glucose_max"

## [1] "[ 1 ] < 100 ( 1889 ) :"

## [1] "auc:"
## [1] 0.7760518
## [1] "accuracy:"
## [1] 0.9534145
## [1] "recall:"
## [1] 1
## [1] "specificity:"
## [1] 0.9533898

## [1] "[ 2 ] 100 ~ 200 ( 11849 ) :"

## [1] "auc:"
## [1] 0.7331695
## [1] "accuracy:"
## [1] 0.8575407
## [1] "recall:"
## [1] 0.5461538
## [1] "specificity:"
## [1] 0.860995

## [1] "[ 3 ] 200 ~ 300 ( 2830 ) :"

## [1] "auc:"
## [1] 0.6437554
## [1] "accuracy:"
## [1] 0.5968198
## [1] "recall:"
## [1] 0.609736
## [1] "specificity:"
## [1] 0.5871446

## [1] "[ 4 ] > 300 ( 1456 ) :"

## [1] "auc:"
## [1] 0.5759749
## [1] "accuracy:"
## [1] 0.6112637
## [1] "recall:"
## [1] 0.625817
## [1] "specificity:"
## [1] 0.5344828
## [1] "wbc_apache"

## [1] "[ 1 ] < 5 ( 994 ) :"

## [1] "auc:"
## [1] 0.7947097
## [1] "accuracy:"
## [1] 0.8138833
## [1] "recall:"
## [1] 0.6168224
## [1] "specificity:"
## [1] 0.837655

## [1] "[ 2 ] 5 ~ 10 ( 5506 ) :"

## [1] "auc:"
## [1] 0.8156433
## [1] "accuracy:"
## [1] 0.806393
## [1] "recall:"
## [1] 0.6215236
## [1] "specificity:"
## [1] 0.8390682

## [1] "[ 3 ] 10 ~ 15 ( 7902 ) :"

## [1] "auc:"
## [1] 0.8221807
## [1] "accuracy:"
## [1] 0.8171349
## [1] "recall:"
## [1] 0.6213848
## [1] "specificity:"
## [1] 0.8501701

## [1] "[ 4 ] 15 ~ 20 ( 1960 ) :"

## [1] "auc:"
## [1] 0.792613
## [1] "accuracy:"
## [1] 0.7897959
## [1] "recall:"
## [1] 0.57
## [1] "specificity:"
## [1] 0.8295181

## [1] "[ 5 ] > 20 ( 1602 ) :"

## [1] "auc:"
## [1] 0.794275
## [1] "accuracy:"
## [1] 0.7734082
## [1] "recall:"
## [1] 0.5804598
## [1] "specificity:"
## [1] 0.7969188
var_list <- c("gender","ethnicity","age","bmi","glucose_apache", "h1_glucose_max", "d1_glucose_max", "wbc_apache")

for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, auc)) + 
  xlab(var) + ylab("AUC") +
  geom_bar(stat = "identity"))

print("Gap:")
print(max(tmp$auc)-min(tmp$auc))
}

## [1] "Gap:"
## [1] 0.01498192

## [1] "Gap:"
## [1] 0.04003881

## [1] "Gap:"
## [1] 0.2181687

## [1] "Gap:"
## [1] 0.07961078

## [1] "Gap:"
## [1] 0.2060869

## [1] "Gap:"
## [1] 0.2324429

## [1] "Gap:"
## [1] 0.2000769

## [1] "Gap:"
## [1] 0.02956778
for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, accuracy)) + 
  xlab(var) + ylab("Accuracy") +
  geom_bar(stat = "identity"))
  print("Gap:")
  print(max(tmp$accuracy)-min(tmp$accuracy))
}

## [1] "Gap:"
## [1] 0.01828363

## [1] "Gap:"
## [1] 0.07823312

## [1] "Gap:"
## [1] 0.1857057

## [1] "Gap:"
## [1] 0.1756053

## [1] "Gap:"
## [1] 0.2571119

## [1] "Gap:"
## [1] 0.2379071

## [1] "Gap:"
## [1] 0.3565947

## [1] "Gap:"
## [1] 0.04372666
for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, recall)) + 
  xlab(var) + ylab("Recall") +
  geom_bar(stat = "identity"))
  print("Gap:")
  print(max(tmp$recall)-min(tmp$recall))
}

## [1] "Gap:"
## [1] 0.03764729

## [1] "Gap:"
## [1] 0.0540293

## [1] "Gap:"
## [1] 0.1200581

## [1] "Gap:"
## [1] 0.1381507

## [1] "Gap:"
## [1] 0.07395678

## [1] "Gap:"
## [1] 0.08473348

## [1] "Gap:"
## [1] 0.4538462

## [1] "Gap:"
## [1] 0.05152358
for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, specificity)) + 
  xlab(var) + ylab("Specificity") +
  geom_bar(stat = "identity"))
  print("Gap:")
  print(max(tmp$specificity)-min(tmp$specificity))
}

## [1] "Gap:"
## [1] 0.01400453

## [1] "Gap:"
## [1] 0.0735439

## [1] "Gap:"
## [1] 0.1935392

## [1] "Gap:"
## [1] 0.1798591

## [1] "Gap:"
## [1] 0.313708

## [1] "Gap:"
## [1] 0.3440934

## [1] "Gap:"
## [1] 0.4189071

## [1] "Gap:"
## [1] 0.05325133